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Abstract.  We  report  on  experiments  with  a  mobile  robot  using  one  vision  process  (forward 
motion  vision)  to  calibrate  another  (stereo  vision)  without  resorting  to  any  external  units 
of  measurement.  Both  are  calibrated  to  a  velocity  dependent  coordinate  system  which  is 
natural  to  the  task  of  obstacle  avoidance.  The  foundations  of  these  algorithms,  in  a  world 
of  perfect  measurement,  are  quite  elementary.  The  contribution  of  this  work  is  to  make 
them  noise  tolerant  while  remaining  simple  computationally.  Both  the  algorithms  and  the 
calibration  procedure  are  easy  to  implement  and  have  shallow  computational  depth,  making 
them  (1)  run  at  reasonable  speed  on  moderate  uni-processors,  (2)  appear  practical  to  run 
continuously,  maintaining  an  up-to-the-second  calibration  on  a  mobile  robot,  and  (3)  appear 
to  be  good  candidates  for  massively  parallel  implementations. 
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1.  Introduction 

This  paper  is  about  vision  lor  a  mobile  robot.  It  is  about  how  to  build  a  computationally 
cheap  robust  vision  system  which  delivers  data  for  obstacle  avoidance.  The  vision  system  is 
continually  self  calibrating,  making  it  tolerant  of  normal  mechanical  drift.  But  better  than 
that,  it  is  tolerant  of  severe  blows  to  its  head-like  sensor  platforms.  After  a  few  seconds  in 
a  trauma-induced  daze  it  adapts  to  its  grossly  altered  sensor  alignments. 

But  wait,  there’s  more!  The  algorithms  require  no  pre-knowledge  of  camera  focal  lengths, 
fine  orientation,  or  stereo  baseline  separation.  With  such  quick  calibration  and  adaption 
the  sensors  can  be  mounted  on  cheap  steerable  systems.  We  can  trade  cheap  computation 
for  deficiencies  which  arise  from  avoiding  expensive  mechanical  solutions  to  sensor  steering 
problems. 

We  begin  with  three  observations: 

•  Humans  are  able  to  extract  meanginful  information  from  images  without  being  aware  of 
the  camera  geometry  (e.g.  baseline  separation  in  stereo)  or  the  focal  parameters  of  the 
imaging  system.  They  can  adapt  almost  instantly  to  TV  or  movie  images  made  with 
unknown  optics  and  relatively  quickly  to  disturbances  in  the  optical  pathway  to  their 
retinas.  For  instance  a  change  in  stereo  baseline  separation  induced  by  special  glasses 
can  be  adapted  for  in  a  matter  of  seconds.  In  contrast,  most  mobile  robots  today  require 
precisely  understood  optics  and  many  seconds  of  intense  computation  at  the  begining  of 
an  experimental  run  (Faugeras  and  Toscani  (1986))  to  accommodate  small  mechanical 
and  electronic  drifts  in  their  systems. 

•  Marr  (1982)  points  out  that  the  purpose  of  vision  depends  on  the  task  the  perceiving 
organism  is  trying  to  achieve.  Brooks  (1986)  demonstrates  that  data  from  a  single  sensor 
system  can  be  used  in  entirely  independent  channels  (including  independent  perception 
systems)  to  control  different  aspects  of  the  behavior  of  a  mobile  robot.  Thus  there  need 
not  be  just  a  single  purpose  of  vision.  A  useful  engineering  consideration  in  building 
a  perception  system,  then,  is  to  analyze  the  requirements  in  terms  of  the  task  to  be 
achieved  using  the  output  of  the  perception  system. 

•  Much  of  human  and  animal  vision  is  extremely  fast,  taking  place  in  small  fractions  of 
a  second.  However,  it  is  implemented  on  hardware  which  is  extremely  slow  (perhaps 
one  thousand  gate  delays  per  second)  as  compared  with  today’s  computer  hardware 
(ten  to  one  hundred  million  or  more  gate  delays  per  second).  Nonetheless  biological 
vision  is  vastly  superior  to  current  computer  implementations.  Biological  algorithms 
must  therefore  be  computationally  shallow  in  order  to  fit  on  the  available  hardware. 

In  this  paper  we  explore  some  visual  techniques  useful  for  a  mobile  robot  navigating  in 
indoor  environments.  The  particular  task  these  techniques  support  is  avoiding  obstacles. 
Brooks  (1986)  shows  how  to  separate  this  particular  task  from  others  that  a  mobile  robot 
may  concurrently  be  pursuing. 

Noting  the  earlier  observations,  we  are  interested  in  finding  self  calibrating  algorithms 
which  are  tolerant  of  large  drifts  in  optical  properties  of  the  imaging  system,  in  finding 
algorithms  which  have  shallow  computation  depth,  and  in  finding  algorithms  that  are  well 
suited  to  the  obstacle  aviodance  task. 

The  experiments  we  describe  in  this  paper  have  been  performed  using  Allen  the  robot 
(Brooks  (1987))  as  a  sensor  platform  and  an  offboard  Lisp  machine  as  the  computational 


Figure  1.  Two  cameras  are  mounted  on  Allen  using  standard  tripod  mounts.  With  such  mounts 
it  is  certain  that  the  cameras  will  not  be  parallel.  There  is  also  mechanical  misalignment  between 
the  camera  mounts  and  the  drive  mechanism  of  the  robot.  A  more  expensive  mechanical  system, 
and  a  more  complex  camera  mounting  system  could  overcome  these  limitations,  but  the  system 
would  still  be  susceptible  to  misalignment  through  minor  mechanical  damage  If  we  don’t  demand 
mechanical  alignment  we  won’t  be  perturbed  by  normal  wear  and  tear 

engine.  Tim  robot  has  two  approximately  forward-looking  CCD  cameras.  A  standard 
camera  mount  is  used  to  attach  them  to  a  tilt  head,  so  there  is  considerable  risk  of  not 
having  the  cameras  pointing  directly  forward.  Figure  1  shows  the  cameras  mounted  and  on 
the  robot. 

/./  Forward  motion  and  .steno 

Our  primary  algorithm  analyzes  forward  motion  by  tracking  strong  vertical  <  dges  over  many 
consecutive  images.  As  Holies,  Baker  and  Marimont  ( 1  7 )  have  pointed  out,  one  avoids 

the  problem  of  solving  for  corresponding  points  by  taking  closely  spaced  images. 

Unlike  Holies.  Baker  and  Marimont,  however,  we  are  addressing  the  problem  of  forward 
rather  than  lateral  motion,  and  we  do  not  assume  knowledge  of  cam*  ra  motion  or  of  camera 
angle  relative  to  motion. 

Forward  motion  analysis  relies  on  straight  line  motion  of  the  ohot  at  constant,  but 
unknown,  velocity.  Imt  does  not  require  that  the  camera  point  dir««-i|y  ahead,  nor  that  it 
point  at  a  known  angle  relative  to  the  motion.  There  are  st  rong  con  -t  mints  on  the  possible 
motion  of  features  if  the  robot  motion  constraints  are  met.  so  'i  is  (  Hen  possible  to  detect 
when  the  motion  constraints  are  being  violated. 


The  particular  way  in  which  forward  motion  analysis  supports  the  obstacle  avoidance  task 
is  to  deliver  distances  to  obstacles  in  units  of  time  to  collision  at  the  current  velocity.  This 
is  the  perfect  coordinate  system  for  obstacle  avoidance.  Steering  commands  and  velocity 
change  commands  can  be  given  without  the  need  to  convert  from  an  absolute  coordinate 
system  to  desired  velocities  for  the  robot. 

Forward  motion  analysis  has  some  surprising  independence  properties; 

•  the  details  of  the  focal  geometry  of  a  perspective  camera  are  unimportant, 

•  the  camera  need  not  be  pointing  directly  in  the  direction  of  motion  (there  is  a  sim¬ 
ple  procedure  which  recovers  the  orientation  of  the  camera  relative  to  the  direction  of 
motion), 

•  and  it  naturally  delivers  the  distance  to  the  physical  artifacts  giving  rise  to  tracked  fea¬ 
tures  as  the  time  to  collision  with  that  artifact  in  units  of  the  inter-frame  time  intervals. 

These  properties  make  forward  motion  analysis  ideal  as  an  input  to  an  obstacle  avoidance 
task  on  the  robot.  Additionally  it  is  easily  implementable  on  hardware  which  supports  only 
16  bit  arithmetic. 

Unfortunately  forward  motion  analysis  has  some  drawbacks  also; 

•  it  doesn’t  deliver  distances  to  obstacles  which  are  straight  ahead  along  the  direction  of 
motion  (and  clearly  these  are  some  of  the  most  important  obstacles  to  consider), 

•  and  it  is  useless  when  the  robot  is  stationary  or  turning  in  place.  Thus  when  the  robot 
stops  to  turn,  it  is  completely  blind  in  its  new  direction  of  motion  until  it  has  moved  in 
that  direction  for  a  period  of  time.  In  a  cluttered  environment  it  may  be  important  to 
be  able  to  plan  an  obstacle  free  path  before  starting  to  move. 

To  compensate  for  these  shortcomings  we  use  a  second  early  vision  algorithm  whose  prop¬ 
erties  exactly  complement  those  of  forward  motion  analysis.  We  use  a  single  scanline  stereo 
algorithm.  Stereo  provides  depth  measurements 

•  straight  ahead, 

•  and  when  the  robot  is  stationary. 

But  stereo  too  has  a  serious  drawback.  Even  rough  depth  measurements  rely  on  a  knowledge 
of  camera  focal  geometry  and  the  six  parameters  relating  the  geometries  of  the  two  cam¬ 
eras.  Alignment  errors  as  small  as  ±5°  for  each  camera  can  lead  to  wildly  incorrect  depth 
estimates,  including  negative  depths  and  an  order  of  magnitude  wrong  positive  depths. 

The  main  result  of  this  paper  is  a  simple  method  of  continously  calibrating  the  stereo 
system  to  reliably  deliver  depths  in  the  same  units  as  the  forward  motion  analysis  algorithms. 

1.2  Experimental  assumptions 

Man-made  indoor  environments  give  rise  to  images  with  strong  vertical  edges.  We  can  make 
use  of  this  fact  in  designing  our  algorithms. 

The  forward  motion  and  stereo  algorithms  both  assume  one  dimensional  cameras  with 
the  plane  defined  by  the  image  line  and  the  optical  center  parallel  to  the  ground  (this  was 
also  assumed  by  Bolles,  Baker  and  Marimont  (1987)).  In  the  current  experiments  we  use 
regular  video  CCD  cameras  and  average  a  swath  of  16  scanlines  from  the  middle  of  the 
image.  This  highlights  strong  vertical  edges.  Grey  levels  are  taken  to  8  bits  and  range  from 
0  (white)  to  255  (black). 


Two  cameras  were  mounted  side  by  side  facing  forward,  separated  by  approximately 
8  inches.  The  cameras  were  only  approximately  aligned  in  the  forward  direction.  Our 
analysis  below  allows  up  to  a  ±5°  misalignment  of  each  camera.  It  assumes  the  cameras 
are  restricted  to  a  field  of  view  of  GO0.  Knowing  the  field  of  view  lets  us  compute  the  focal 
ratio  of  the  camera.  The  analysis  assumes  the  two  cameras  are  identical  in  this  regard  but 
is  easily  generalized  to  two  different  but  known  cameras. 

In  our  experiments  we  used  cameras  and  lens  systems  with  approximately  a  60°  field 
of  view.  There  is  only  one  place,  and  that  is  in  the  forward  motion  algorithm,  where 
knowledge  of  the  field  of  view  and  hence  focal  ratio  is  used.  In  section  2  we  show  that  in 
order  to  relate  the  depth  estimates,  for  stereo  calibration,  obtained  from  two  cameras  we 
must  include  a  small  error  term  to  correct  for  their  misalignment.  It  is  only  in  this  error 
term  that  the  field  of  view  is  needed.  We  show  that  these  errors  can  be  at  most  ±5%  so 
only  an  approximate  knowledge  of  the  field  of  view  is  necessary.  For  a  robot  which  can 
turn  in  place  and  which  has  appropriate  encoders  to  measure  its  turn,  the  camera  field  of 
view  could  easily  be  calibrated  each  time  it  turns  by  tracking  image  points  over  a  signifeant 
portion  of  the  image.  We  did  not  so  estimate  the  field  of  view  in  these  experiments  but 
assumed  it  a  priori. 

All  computation  was  done  offboard  and  offline  on  a  Symbolics  lisp  machine.  Total 
computation  time  for  the  reported  experiments  is  on  the  order  of  a  few  seconds  running 
unoptimized  lisp  code. 

We  plan  on  porting  these  algorithms  to  a  new  robot  with  all  onboard  computation. 
On  that  robot  we  plan  on  using  cylindrical  lenses  and  single  scan  line  horizontal  CCD 
cameras.  The  lenses  will  optically  average  a  horizontal  swath  of  a  more  conventional  image 
highlighting  strong  vertical  edges. 

For  the  task  of  obstacle  avoidance  we  estimate  that  knowledge  of  the  distance  to  obstacles 
to  within  ±10%  is  quite  adequate. 

All  our  experiments  described  in  this  paper  give  distances  in  terms  of  the  robot’s  current 
speed.  We  did  not  try  to  measure  that  quantity.  Our  evaluation  of  our  experiments  cannot 
therefore  be  based  on  determinations  of  absolute  accuracy.  Rather,  we  compare  computed 
relative  distances  to  pairs  of  tracked  objects  as  a  measure  of  accuracy,  as  in  an  ideal  exper¬ 
iment  all  such  relative  distances  should  remain  constant.  On  this  basis  we  believe  we  have 
achieved  ±10%  accuracy  with  very  computationally  simple  and  shallow  depth  algorithms. 

2.  Forward  Motion  Vision 

In  this  section  we  first  derive  the  equations  of  forward  motion  vision  without  regard  to 
sensitivity  to  noise  and  sampling  errors.  We  then  describe  some  practical  algorithms  to 
overcome  problems  due  to  noise  and  sampling  errors. 

2. 1  One  dimensional  camera  geometry 

Consider  figure  3.  It  is  a  view  from  above  of  a  camera.  The  image  plane  is  a  one  dimensional 
strip,  i.e.,  the  camera  provides  only  a  1-D  image. 

The  image  plane  has  P  pixels,  numbered  0  through  P  —  1.  The  optical  center  of  the 
camera  is  distance  /,  measured  in  pixels,  from  the  image  plane.  We  call  /  the  focal  ratio. 


The  optical  center  plant  is  parallel  to  the  image  plane  and  runs  through  the  optical  center. 
A  lino,  the  optical  ai'is.  runs  through  the  optical  center  and  intersects  the  image  plane  at 
right  angles.  'I' he  point  of  intersection  divides  the  image  plane  in  two  equal  parts  and  is 
called  the  center  of  vit  tr. 


Points  in  the  world  are  projected  onto  the  image  plane  along  lines  that  run  through  the 
optical  renter.  The  fh  Id  of  view  of  the  camera  is  <j>,  the  maximum  angle  subtended  by  any 
two  visible  points.  Tim  focal  ratio  and  field  of  view  are  related  by: 


2  tan (0/2) 


I  his  gives  /  in  units  of  pixels. 


On  the  camera:.  w<  u.e<|  111  mu  experiments,  which  no-  ,ri7(i  pixels  wide  with  approxi 
inatolv  a  (10°  field  of  view.  /  is  approximately  109  pixels. 
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Figure  3.  A  one  dimensional  camera  has  an  image  plane  /  pixel  units  from  the  optical  center  of 
the  camera.  The  optical  axis  is  perpendicular  to  the  image  plane,  which  is  P  pixels  wide. 


2.2  Forward  motion  vision  geometry 

Now  consider  figure  4.  It  is  a  view  from  above  of  a  camera  whose  optical  axis  is  offset 
somewhat  from  the  direction  of  motion  of  the  camera.  The  camera  is  undergoing  pure 
translation  in  a  constant  direction  of  motion.  The  motion  direction  vector,  the  optical  axis 
and  the  image  strip  on  the  image  plane  are  all  coplanar. 

This  model  corresponds  to  a  1-D  camera  mounted  parallel  to  the  ground  plane  on  a 
mobile  robot  which  is  travelling  forward  without  turning.  Typically  one  would  mount  a 
cylindrical  lens  on  such  a  camera  to  enhance  the  strong  vertical  edges  found  in  man-made 
indoor  environments.  Without  careful  alignment  of  the  camera  it  will  not  point  directly  in 
the  direction  of  motion  of  the  robot.  Even  if  the  camera  is  rigidly  mounted  on  the  body 
of  the  robot,  the  direction  the  robot  travels  with  respect  to  its  body  will  vary  gradually 
over  time  due  to  tire  wear  and  other  mechanical  processes.  On  our  robot  Allen  we  notice 
that  the  body  and  drive  base  of  the  robot  can  easily  be  misaligned  by  a  few  d>  -  '•••■s.  In 
this  section  we  show  how  to  calibrate  dynamically  for  these  effects  with  a  few  arithmetic 
operations  over  a  few  images  taken  as  the  robot  is  moving  at  constant  velocity. 

The  trajectories  of  image  features  under  the  geometry  of  figure  4  are  described  by  Bolles, 
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Figure  4.  As  a  camera  moves  in  a  straight  line,  points  to  the  left  of  the  direction  of  motion  of  the 
camera  appear  to  move  to  the  left  and  those  to  the  right  appear  to  move  to  the  right. 


Baker  and  Marimont  (1987).  Here  we  derive  equations  yielding  the  time-to-collision  and 
the  center  of  expansion  under  this  geometry. 

The  image  of  any  point  pr  which  is  to  the  right  of  the  direction  of  motion  will  move 
rightwards  in  the  image  plane.  The  image  of  any  point  pi  which  is  to  the  left  of  the 
direction  of  motion  will  move  leftwards  in  the  image  plane.  Points  directly  ahead  will  not 
move  at  all  in  the  image.  The  projection  of  such  points  is  called  the  center  of  expansion. 
We  write  Ce,  for  its  coordinate  in  pixels. 

In  an  operational  robot,  we  envision  an  estimator  for  Ce  running  continously  and  slowly 
and  incrementally  updating  the  estimate  as  it  drifts.  More  generally  what  we  would  like  to 
know,  for  the  purpose  of  controlling  a  robot,  is  how  long  it  will  be  before  the  robot  reaches 
some  visible  point  p.  The  forward  motion  analysis  described  here  determines  the  time  that 
will  elapse  for  the  robot  to  travel  forward  far  enough  that  the  plane,  parallel  to  the  image 
plane  and  coincident  with  the  optical  center,  intersects  point  p.  This  assumes  that  the 
robot’s  velocity  is  constant  but  not  that  it  is  known. 

Now  consider  figure  5,  which  illustrates  the  same  physical  setup  as  the  last  figure  but 
with  different  quantities  annotated.  The  camera  has  focal  ratio  /.  Call  the  distance  from 
the  optical  center  to  the  center  of  expansion  /'.  The  camera  is  misaligned  by  an  angle 
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Figure  5.  As  a  camera  moves  forward  in  some  direction  a  offset  from  the  optical  center  of  the 
camera,  we  are  interested  in  computing  the  distance  z  to  a  point  p. 


a  from  pointing  directly  in  the  direction  of  motion.  Suppose  the  camera  is  moving  with 
constant  velocity  v  and  consider  what  happens  to  the  image  of  a  point  p.  Its  physical 
location  can  be  described  by  two  non-orthogonal  coordinates:  x',  its  distance  parallel  to  the 
image  plane  from  the  path  taken  by  the  optical  center  of  the  camera,  and  z',  its  distance  in 
the  direction  of  motion  from  the  optical  center  plane.  Note  that  the  derivative  of  z'  with 
respect  to  time,  z'  is  the  constant  —v. 

From  similar  triangles  we  can  see  that  the  distance,  r,  along  the  image  plane,  of  the 
image  of  point  p  from  the  center  of  expansion  is  given  by: 


and  since  x'  is  constanl  its  derivative  with  respect  to  time  is: 

/V 

r  — - z  . 

iz')2 

We  can  then  observe  that: 


The  last  term  is  the  time  it  will  take  for  the  optical  center  plane  to  intersect  p.  Thus  r/f  is 
just  the  time  to  collision  of  the  optical  center  plane  and  the  physical  feature  whose  image 
is  being  observed. 

The  accuracy  of  forward  motion  analysis  is  very  sensitive  to  the  estimate  we  make  of 
the  center  of  expansion  since  r  is  measured  relative  to  it.  The  position  of  Ce  can  range 
over  a  third  of  the  image  when  camera  alignment  of  a  30°  field  of  view  camera  is  allowed 
to  vary  by  ±5°  (the  range  is  not  so  bad  for  larger  field  of  view  cameras,  but  the  quantity 
being  measured  is  correspondingly  smaller).  Its  value  clearly  impacts  greatly  the  estimates 
of  time  to  collision. 

With  perfect  measurement  it  turns  out  to  be  easy  to  compute  the  value  of  Ce  by  tracking 
a  single  point  over  a  small  time  interval.  Suppose  we  measure  the  image  position  of  a  point 
in  a  coordinate  frame  relative  to  the  left  end  of  the  image.  It  has  position  R,  say,  in  this 
coordinate  system.  Since  Ce  should  remain  constant  we  have  that 

r  =  R  —  Ce,  r  =  R 

so  the  time  to  collision  for  the  optical  center  plane  is 

;  =  o) 

r  R  J 

Suppose  we  can  measure  R  and  R  at  two  distinct  times  to  and  tj.  Then  the  estimate  to 
time  to  collision  should  decrease  by  precisely  t\  -  to-  Writing  our  estimates  as  functions  of 
time  we  then  get  that: 

R(to)  -  Ce  _  <t  _t\zz  -  Ce 
R(to)  1  R(h) 

whence 

R(to)R(ti)  —  R(ti)R(t0)  +  R(ti)R(to)(t\  —  to)  , 

R(i0)-R(h) 

In  summary  then  we  can  easily  (given  perfect  measurement)  estimate  Ce  and  therefore 
compute  the  time  to  collision  of  the  optical  center  plane 

-  =  (5) 

V  R 

However,  time  to  collision  from  the  optical  center  plane  to  the  point  p,  i.e.,  z' /v  is  not 
exactly  the  information  we  want  if  the  camera  has  been  misaligned.  In  order  to  calibrate  our 
stereo  system  in  section  3  of  this  paper  we  will  need  depth  estimates  in  a  common  coordinate 
system  for  the  two  cameras.  Therefore  we  really  want  z/v  where  z  is  the  component  of 
distance  from  the  center  of  the  camera  in  the  direction  of  motion  and 

z  =  z'  4-  x'  sin  a  (6) 

as  can  be  seen  from  figure  5. 

Let  c  —  Ce  ~  P/ 2,  the  distance  in  pixels  along  the  image  plane  from  the  center  of  view 
to  the  center  of  expansion,  then  referring  to  figure  5 


so  by  equation  (2) 


and  since 


we  have 


.  ex'  .  erz' 

z  =  z  +T  =  Z  +(7¥ 


/'  =  y/c*  +  P 


z  —  z' 


'  C*  +  P’  '  ' 

Notice  that  since  we  only  really  know  z'  fv  (from  equation  (5))  this  only  gives  us  z/u,  which 
in  any  case  was  what  we  wanted. 

It  is  interesting  to  ask  just  how  different  z  can  be  from  z'.  Since  a  is  restricted  to  ±5°, 
x1  is  almost  perpendicular  to  z',  so  for  a  camera  with  a  60°  field  of  view 

xf 

-  tan  30°  <  —  <  tan  30° 


and  since  tan  30°  x  sin  5°  =  0.0503  we  see  that  z  =  z'  +  I'sina  is  within  ±5%  of  z'.  Thus, 
since  equation  (7)  is  correcting  such  a  relatively  small  error  it  is  not  critical  to  know  / 
(which  can  be  derived  from  <j>  through  equation  (1))  particularly  accurately. 

We  now  know  z/u,  or  the  time  to  collision  with  the  obstacle,  very  accurately  using  only 
an  imprecise  approximation  to  the  field  of  view  of  the  camera  as  our  a  priori  knowledge 
(which  is  easily  gained  by  rotating  the  camera  a  roughly  known  amount).  Since  time  to 
collision  is  precisely  the  right  quantity  to  know  for  obstacle  aviodance  we  don’t  need  to 
try  to  compute  the  robot’s  velocity  v  at  all.  In  fact  we  don’t  even  need  to  know  how  our 
measure  of  passage  of  time  t  relates  to  external  units.  All  we  need  is  a  steady  on  board 
clock  and  we  can  do  obstacle  aviodance  visually. 


2.3  Tracking  image  features 

The  above  analysis  suggests  the  idea  of  detecting  image  features  and  tracking  them  over 
many  images. 

We  track  edges.  We  simply  convolve  the  image  with  a  derivative  of  a  Gaussian  (the 
actual  mask  is  1,  3,  5,  9,  14,  18,  20,  18,  11,  0,  -11,  -18,  -20,  -18,  -14,  -9,  -5,  -3,  -1) 
and  then  look  for  local  maximum  absolute  values  (Horn  (1986)).  We  threshold  the  edges 
based  on  their  strength  of  convolution  with  the  mask.  We  accept  only  edges  which  have  a 
convolution  value  greater  than  500.  We  do  not  attempt  sub-pixel  localization  of  edges  as 
our  robot  shakes  enough  as  it  rolls  foward  to  impose  at  least  ±1  pixel  error  on  top  of  any 
localization  noise  due  to  discrete  estimates. 

Figure  8  is  a  data  set  showing  edge  traces  from  a  run  with  the  mobile  robot  as  the  robot 
moves  forward.  It  is  a  2-D  binary  array  where  each  row  is  a  one  dimensional  image  and  time 
flows  downwards.  Array  elements  are  1  (black)  where  an  edge  was  detected  and  0  (white) 
otherwise.  Equation  (2)  and  the  constant  velocity  of  the  robot  tell  us  that  each  edge  trace 
is  a  hyperbola.  If  the  images  are  taken  sufficiently  close  together  (such  as  in  this  dataset) 
it  is  very  simple  to  track  edges  without  having  to  refer  to  the  original  grey  level  images. 

It  is  sufficient  to  buffer  only  two  rows  of  the  array  and  keep  track  of  the  direction  and 
velocity  of  an  edge  track  in  order  to  predict  a  small  search  window  and  direction  of  search 
in  the  second  row.  There  are  two  cases  in  searching  for  the  next  edge  in  a  trace;  at  the 
beginning  of  an  edge  trace  where  the  direction  within  the  image  may  not  be  known,  and 
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later  when  the  direction  is  known.  At  all  times  the  possibility  of  noise  in  the  edge  motion 
must  be  taken  into  account. 

When  the  direction  is  not  known  a  default  symmetric  window  (±3  pixels)  is  searched  in 
a  succeeding  row.  If  only  one  edge  element  is  found,  that  is  taken  to  be  the  next  element 
of  the  trace.  If  more  than  one  edge  is  found  the  current  trace  is  abandoned. 

When  the  direction  of  motion  (left  or  right)  is  known  we  keep  track  of  the  edge  velocity 
at  each  step  of  the  trace.  The  trace  is  predicted  to  move  the  same  amount  at  the  next  step 
plus  or  minus  a  small  window  of  margin  in  order  to  accomodate  noise  in  the  image  and 
edge  velocity  increases;  -1  pixel  short  of  prediction  through  +3  pixels  extra.  The  search 
proceeds  in  the  direction  of  motion,  and  the  first  edge  encountered  is  taken  as  the  correct 
edge.  If  no  edge  is  encountered  within  the  window  then  the  edge  trace  is  abandoned.  No 
attempt  is  made  to  hypothesize  a  missing  edge  in  the  given  one  dimensional  image  and  to 
try  to  look  in  the  appropriate  place  in  the  next  image. 

There  are  two  ways  in  which  the  direction  of  edge  motion  can  be  determined. 

After  a  few  elements  of  a  trace  have  been  tracked  the  direction  should  be  clear  by 
comparing  the  position  of  the  first  and  most  recent  pixels.  If  the  direction  is  not  clear  then 
the  trace  can  be  abandoned  as  there  is  no  obvious  depth  information  in  it. 

However,  most  edge  traces  have  only  one  possible  direction  even  considering  only  the 
first  row  in  which  it  appears.  If  Ce  is  known  then  edges  on  the  left  move  leftwards,  and 
edges  on  the  right  move  rightwards.  Even  when  Ce  is  not  known,  an  a  priori  knowledge  of 
/  and  the  maximum  permissible  range  of  a  determines  a  strip  within  the  image  where  it 
is  possible  that  Ce  might  fall,  and  edge  traces  starting  outside  that  strip  have  the  obvious 
direction. 

The  key  heuristic  used  by  the  above  edge  tracking  algorithm  to  handle  noise  is  perhaps 
not  obvious  at  first  glance.  In  fact  the  main  idea  is  to  abandon  an  edge  trace  when  conditions 
get  complicated.  Chances  are  that  the  disturbance  will  last  only  a  few  images  at  most  and 
the  trace  can  be  re-established  as  a  brand  new  trace  a  few  images  later. 

2-4  Noise  and  estimation 

The  analysis  of  section  2.1  assumes  no  noise  and  perfect  measurement  of  various  quantities. 
Real  image  sequences  suffer  from  digitization  effects  and  large  sources  of  noise  due  to  un¬ 
stable  camera  platforms,  unconstant  velocity  and  curved  rather  than  straight  line  motion. 
We  must  therefore  fit  all  our  measurements  over  many  images. 

To  estimate  the  center  of  expansion  we  need  to  know  R  and  R  at  more  than  one  place 
along  an  edge  trace.  To  decrease  the  effects  of  noise  it  is  clearly  best  to  use  more  than  one 
edge  trace.  We  use  every  edge  trace  and  continually  refine  our  estimate  of  Ce- 

Quantity  R  is  measured  directly  from  the  edge  pixel  array.  To  estimate  R,  the  edge 
velocity,  we  trade  off  localization  in  time  of  our  estimate  with  accuracy  of  the  estimate 
through  the  choice  of  a  constant  V.  In  all  our  experiments  reported  here  we  have  used 

V  =  4.  We  approximate  R  by  the  slope  of  a  chord  between  two  points  on  a  hyperbolic  trace 

V  images  apart.  By  Rolle’s  theorem  some  point  on  the  hyperbola  between  those  two  points 
has  exactly  that  slope.  We  arbitrarily  choose  the  midpoint  in  time.  A  larger  V  reduces  the 
magnitude  effects  of  noise  on  the  estimate  but  increases  our  localization  error.  We  see  that 
VI I  is  simply  the  distance  travelled  by  the  edge  trace  over  V  time  intervals. 
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Figure  6.  Image  strips  averaged  to  a  single  scan  line  and  ordered  in  time  for  100  images  from  the 
left  camera. 


Figure  7.  Image  strips  averaged  to  a  single  scan  line  and  ordered  in  time  for  100  images  from  the 
right  camera. 


Figure  8.  The  edges  detected  in  figure  6. 
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Figure  9.  The  edges  detected  in  figure  7. 


In  the  experiments  reported  in  this  paper  we  estimate  V  R  at  every  pair  of  points  on  a 
trace  separated  by  31^  steps  in  time,  and  use  them  to  estimate  Ce  using  equation  (4)  (by 
estimating  V R  rather  than  R,  only  one  division  is  necessary  for  each  estimate  of  Ce)-  1 
Our  estimates  over  all  time  and  all  traces  are  then  averaged  together.  We  choose  3V  so  that 
R  will  have  had  time  to  change  significantly  and  render  the  estimate  for  Ce  more  stable. 


Thus  cadi  estimate  relies  on  four  points  along  the  edge  trace.  Horn  (1987)  has  suggested  an 
iterative  scheme  for  solving  for  a  least  squares  fit  along  a  complete  trace. 
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A  more  sophisticated  algorithm  could  dynamically  choose  the  step  size  for  the  comparison 
based  on  the  edge  velocity. 

Once  Ce  has  been  determined  it  is  easy  to  compute  the  time  to  collision  for  a  given  edge 
point  in  a  given  edge  using  the  same  estimate  for  R  as  above.  However  since  we  are  tracking 
edges  and  since  edges  correspond  (hopefully)  to  a  single  three  space  feature,  the  time  to 
collision  should  be  reducing  by  one  every  image.  We  can  therefore  smooth  our  estimate  for 
the  time  to  collision  by  fitting  a  line  of  slope  -1  over  S  images  steps  (i.e.,  5+1  images). 
A  least  squares  fit  of  such  a  line  amounts  to  averaging  the  time  to  collision  estimate  along 
a  trace  over  a  set  of  5  +  1  consecuctive  images. 

Thus  our  estimates  for  time  to  collision  are  valid  for  an  image  (V  +  5)/2  prior  to  the 
most  recently  processed  image. 

In  all  our  experiments  reported  here  we  have  used  5  =  4. 

Using  this  algorithm  the  centers  of  expansion  are  estimated  at  pixels  243  and  274  in 
figures  8  and  9  respectively.  These  displays  of  edge  traces  are  from  a  stereo  pair  of  left 
and  right  cameras  mounted  on  our  mobile  robot.  The  cameras  are  576  pixels  wide  so  these 
estimates  say  that  each  camera  was  skewed  slightly  to  the  right,  with  the  left  one  skewed 
more  than  the  right.  This  corresponds  to  our  visual  estimate  from  examining  the  camera 
mounts  at  the  beginning  of  the  experimental  run.  Figures  6  and  7  show  the  grey  level 
images  from  which  these  edge  arrays  were  extracted. 

2.5  Approximation  errors 

It  is  worth  asking  how  well  this  scheme  does  for  computing  depth  even  with  synthetic  data. 
There  are  a  number  of  sources  of  inaccuracies;  all  edge  locations  are  approximated  to  an 
integral  number  of  pixels,  divisions  (such  as  in  equation  (4))  are  rounded  to  the  nearest 
integer,  and  the  edge  velocity  estimates  are  only  difference  approximations  to  derivatives. 
With  realistic  camera  parameters  and  using  V  —  4  and  5  =  4  we  have  experimentally  found 
accuracy  on  synthetic  edge  arrays  to  be  about  2%. 

3.  Calibrating  Stereo 

Recall  that  motion  vision  does  not  get  good  results  close  to  the  direction  of  motion  since 
estimates  for  R  are  necessarily  small  and  therefore  susceptible  to  noise  problems.  Also  it  is 
useless  when  the  robot  is  still,  or  not  moving  in  straight  lines.  Stereo  vision  suffers  from  none 
of  these  problems.  Our  stereo  algorithm  matches  edges  in  two  one  dimensional  images.  If 
all  the  camera  parameters  are  known  in  advance  we  could  then  compute  depth.  An  accurate 
model  of  the  camera  geometry  is  not  sufficient  because  stereo  is  very  susceptible  to  small 
camera  misalignments  and  so  the  cameras  must  be  repeatedly  calibrated. 

One  approach  to  this  problem  is  to  run  a  calibration  procedure  prior  to  running  the  robot 
(e.g.  Faugeras  and  Toscani  ( 1986)).  It  usually  involves  placing  the  robot  in  front  of  a  known 
test  pattern  and  examining  the  test  pattern  with  the  visual  system.  It  is  usually  necessary  to 
run  the  calibration  procedure  before  every  run  of  the  robot  because  of  mechanical  (thermal 
and  other)  drift  in  the  relative  position  of  the  optics  and  the  robot’s  drive  mechanism. 

A  second  approach  is  to  calibrate  from  actual  images  during  a  run.  Longuet-Higgins 
( 1981 )  shows  how  to  recover  all  camera  parameters  save  a  scale  factor  from  a  single  pair  of 


images.  However  we  also  need  to  know  the  parameters  of  the  cameras  relative  to  the  motion 
of  the  robot,  as  in  general  the  camera  platform  can  mechanically  drift  relative  to  the  drive 
platform.  Therefore  we  use  the  results  of  the  forward  motion  algorithms  to  calibrate  our 
cameras. 

3.1  The  algorithm 

We  use  the  same  edge  operator  as  for  forward  motion  and  apply  the  operator  to  each  of  the 
left  and  right  images. 

Matching  of  edges  in  the  two  images  is  accomplished  by  means  of  a  dynamic  programming 
algorithm  similar  to  that  developed  by  Ohta  and  Kanade  (1986)  for  two  dimensional  images 
and  by  Serey  and  Matthies  (1987)  for  single  scanline  images,  except  that  the  co6t  functions 
here  are  different. 

The  dynamic  programming  algorithm  searches  for  a  minimum  cost  path  left  to  right 
across  the  two  images  where  there  is  a  cost  associated  with  matching  two  edges,  and  a  cost 
with  skipping  an  edge  whether  it  is  due  to  noise,  occlusion,  uncovering  or  something  else. 

The  skipping  cost  is  a  constant  (2000  in  the  experiments  reported  in  this  paper). 

The  matching  cost  is  a  larger  number  (effectively  oc)  unless  the  signs  of  the  gradients  of 
the  two  edges  are  the  same  (i.e.,  unless  the  grey  levels  of  both  edges  go  from  dark  to  light 
or  vice  versa).  Otherwise  the  matching  cost  is  the  sum  of  squares  of  differences  of  pixel 
grey  levels  on  a  small  window  around  the  edge  (we  used  7  pixels  centered  on  the  edge,  in 
the  experiments  reported  here). 

3.2  What  needs  to  be  calibrated 

There  are  a  large  number  of  parameters  needed  to  describe  the  optical  system  used  for 
stereo  vision.  Each  of  the  two  cameras  has  a  focal  ratio,  and  each  has  six  positional  and 
rotational  degrees  of  freedom  relative  to  the  robot.  We  are  forced  to  calibrate  for  some 
of  these  parameters,  we  can  ignore  some  because  their  effects  are  too  small  to  notice,  and 
others  we  take  care  of  by  our  choice  of  stereo  algorithm,  camera  design,  and  methodology. 

Figure  10  illustrates  the  camera  geometry  we  are  assuming.  The  robot  coordinate  system 
has  the  2-axis  pointing  in  the  direction  of  travel  of  the  robot.  The  y  axis  is  vertical  and  the 
x  axis  is  perpendicular  to  the  direction  of  motion. 

We  assume  that  the  two  cameras  have  the  same  focal  ratio  /  or  field  of  view.2  Our 
forward  motion  algorithm  assumed  some  a  priori  knowledge  of  this  parameter  and  we  sug¬ 
gested  a  simple  way  to  determine  a  rough  estimate  for  it.  Rather  than  rely  on  such  an 
estimate  we  calibrate  the  stereo  system  assuming  no  previous  knowledge  of  /. 

Now  consider  the  geometric  degrees  of  freedom  and  refer  to  figure  10  for  definitions  of 
geometric  quantities. 

x  Our  calibration  assumes  the  base  line  separation  of  the  cameras  B  in  the  x  direction 
is  unknown. 

y  In  general  one  would  assume  this  parameter  to  be  small,  but  in  any  case  our  use 
of  the  average  of  16  lines  of  grey  level  data  in  the  current  experiments,  and  our 

3  It  is  possible  to  relax  this  assumption  but  the  introduction  of  the  necessary  extra  parameter  in 
our  calibration  equations  seems  to  make  the  calibration  less  stable.  We  do  not  have  a  theoretical 
basis  for  this  remark  only  empirical 
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intention  to  use  cylindrical  lenses  mean  that  larger  values  of  y  will  not  effect  the 
stereo  measurements  at  all. 

z  There  may  be  small  errors  in  mounting  the  cameras  which  lead  to  a  non-zero  Zo- 
We  show  below  that  for  small  zo  the  effects  are  minimal  and  can  be  ignored. 
pitch  Again  the  averaging  of  scanlines  or  the  use  of  cylindrical  lenses  takes  care  of  this 
parameter. 

roll  One  would  expect  this  parameter  to  be  small.  One  could  measure  the  fuzziness  of 
edges  (since  the  image  is  really  a  vertically  averaged,  digitally  or  optically,  image)  to 
estimate  the  amount  of  roll  present  in  the  cameras.  We  will  ignore  this  parameter. 
yaw  Each  camera  can  swing  from  side  to  side  on  a  standard  camera  mount,  and  in 
any  case  the  orientation  of  the  camera  could  easily  drift  mechanically  from  the 
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orientation  of  the  robot’s  wheels.  (It  certainly  does  on  our  robots.)  Our  calibration 
assumes  small  unknown  yaw  angles  on  each  camera. 


3.3  Calibration  formulation 

Figure  10  shows  a  view  from  above  of  a  stereo  pair  of  cameras  mounted  on  a  mobile  robot 
pointing  approximately  in  the  direction  of  motion.  We  assume  both  cameras  have  focal 
ratio  /.  The  optical  axis  of  the  left  camera  is  misaligned  by  angle  a  from  the  direction 
of  motion,  while  the  right  camera  is  misaligned  by  angle  /i.  The  baseline  separation,  B, 
is  measured  perpendicular  to  the  direction  of  motion.  20  is  the  misalignment  offset  of  the 
optical  centers  in  the  direction  of  motion. 

We  will  assume  that  none  of  these  five  parameters  (a,  /l,  f,B,  or  :0)  is  known  a  priori. 
We  will,  however,  assume  that  we  know  bounds  on  some  of  these  quantities.  We  assume 
that  a  and  (3  are  no  bigger  than±5°,  that  /,  in  pixels,  is  comparable  or  larger  than  P  the 
width  of  the  image  plane,  (499  compared  to  576  in  our  experiments)  and  that  zq  is  much 
smaller  than  B. 

The  quantities  we  can  directly  measure  given  some  matched  feature  in  the  two  images 
are  dx  and  d2,  the  distances  in  pixels  from  the  centers  of  view  of  the  left  and  right  cameras 
respectively.  Due  to  our  assumption  above  about  the  size  of  /,  we  know  that  roughly 

\d,\  <  t.  (8) 

The  quantity  we  wish  to  compute  given  some  matched  feature  in  the  two  images  is  its 
distance  z  in  the  direction  of  motion.  Figure  10  shows  that  2  is  measured  from  the  optical 
center  of  the  left  camera.  The  only  other  parameter  we  could  compute  is  x,  the  displacement 
of  the  feature  perpendicular  to  the  direction  of  motion.  We  use  x  in  deriving  equations  for 
2  but  do  not  compute  it  explicitly  in  any  of  our  experiments.  Given  an  estimate  of  /  it  can 
easily  be  recovered  for  use  in  obstacle  avoidance. 

Consider  the  quantities  labelled  a  and  6  in  the  figure.  They  both  relate  to  the  z  distance 
of  the  feature  from  the  left  camera.  We  can  write 

2  =a  —  6 
a  =zx  cos  a 
b  =xx  sin  a 

where  zx  is  the  distance  of  the  feature  in  the  direction  of  the  optical  axis  of  the  left  camera 
and  xx  is  the  feature’s  transverse  distance  from  that  axis.  Now,  because  we  are  using 
perspective  projection  we  can  write 

*t_  _  d± 

z\  f 

giving  us  an  expression  for  xx  in  terms  of  zx  and  so  we  can  rewrite  2  as 

■  di 

z  =  a  —  ft  =  2i(cos«  —  —  sin  a)  (9) 

which  is  a  linear  expression  in  zx.  We  can  get  a  similar  linear  expression  for  2  in  terms  of 
22  (it  involves  the  constant  offset  2o)  and  thus  get  a  linear  equation  relating  zx  and  z2. 

We  can  do  the  same  analysis  for  x  getting  a  second  simultaneous  linear  equation  relating 
2 1  and  22.  Solving  and  substituting  back  in  equation  (9)  we  obtain 

Pzo  +  i'B 


(10) 


17 


where 


N‘-V 


*v' 


p  —  -  /ri>  cos  a  cos  0  +  fd\  sin  a  sin  0  —  /2  cos  a  sin  /? 

+  did?  sin  a  cos  0 

v  —f2  cos  a  cos  0  +  didi  sin  a  sin  0  —  /<f2  cos  a  sin  /? 

-  fd\  sin  q  cos  0 

p  =(/2  +  dxd2 )  sin(a  -  3)  +  f(d\  -  d2)cos(a  -  0). 

We  now  have  an  equation  for  z,  the  quantity  we  wish  to  compute,  in  terms  of  d\  and  d2 
the  quantities  we  can  measure.  We  want  to  derive  a  calibration  procedure  to  identify  the 
unknown  parameters  in  this  equation.  We  begin  by  noting  that  we  can  safely  ignore  some 
terms. 

The  sines  of  a  and  0  are  both  less  than  0.1.  The  cosines  are  ail  greater  than  0.995. 
Thus  the  dominating  term  in  p  is  roughly  /d2  which  by  equation  (8)  is  less  than  half  the 
dominating  term  /2  in  u.  Since  we  also  are  assuming  that  zq  is  much  smaller  then  B  we 
will  simply  ignore  the  p  term.  We  can  also  ignore  all  but  the  /2  term  in  v  as  the  other 
three  are  each  at  least  20  times  smaller. 

Turning  attention  to  p  we  first  observe  that  (dx  —  d2)  can  be  arbitrarily  small,  and  that 
sin(a  -  0)  can  be  larger  than  0.17.  Thus  since  either  term  can  dominate  both  terms  are 
important.  The  only  remaining  question  is  whether  in  the  left  hand  term  we  can  ignore  d\di 
as  it  is  never  more  than  1/4  the  size  of  /2.  We  choose  to  ignore  it  for  now  in  our  derivation 
of  a  calibration  procedure,  but  later  we  compare  its  inclusion  and  exclusion  experimentally 
and  conclude  that  it  is  insignificant. 

With  these  approximations,  and  dividing  both  the  top  and  bottom  of  equation  (10) 
through  by  /cos(o  -  0)  we  get 

A 

(H) 


Z  - 


r  +  d\  —  «f2 

where  A  and  T  are  constants  to  be  determined  by  calibration. 

Suppose  we  have  a  collection  of  stereo  images  of  features  with  known  distance  z,  i.e.,  we 
have  a  collection  of  n  triples  (d^,d2j,Zj).  For  a  given  A  and  T,  we  could  write  as  a  measure 
of  closeness  of  fit  for  each  triple 


-  -  r  -du  +  d 


2  «• 


(12) 


This  is  both  linear  in  A  and  T  and  a  good  measure  of  error  in  the  image  plane. 

When  we  minimize  the  sum  of  squares  of  measure  ( 12)  for  n  calibration  triples  wc  obtain 

«E(4  -  d2 <)/*.  -  E (i/*i)E(dii  - d2i) 


A  = 


r  = 


«E(i/2:.2)  -  (El/**)2 

Ed/A)E(<(ii-4)Ai-E(i/^)(E<(ii-4) 


nEO/2.2)  ~  (E  l/*i)2 

where  all  the  sums  range  over  the  n  values  of  i. 


3.4  Calibration  procedure 


We  collect  triples  to  use  for  calibrating  the  stereo  system  by  using  forward  motion  analysis 
through  equation  (7)  to  get  depth  estimates  for  points  visible  in  both  the  left  camera  and 
right  camera.  For  ear'll  pair  of  images  we  know  the  pixel  coordinates  of  each  point  for  which 


Figure  11.  The  left  image  of  a  stero  pair  at  the  beginning  of  a  straight  line  motion  of  the  robot. 


we  have  a  depth  estimate.  We  run  the  stereo  algorithm  on  each  pair  of  images  and  for  each 
match  it  finds  we  check  to  see  whether  forward  motion  delivered  a  depth  estimate  for  both 
the  edge  in  the  left  image  and  the  edge  in  the  right  image.  If  so,  and  if  those  estimates  are 
close  (within  ±10%  of  their  average  in  our  experiments)  we  use  their  average  as  the  third 
element  of  a  triple  which  includes  the  left  and  right  edge  pixel  coordinates. 

When  a  few  tens  of  triples  have  been  collected  we  plug  them  into  the  least  squares 
formulation  to  get  estimates  of  A  and  T. 


3.5  Experimental  results 

To  test  the  preceeding  algorithms  we  set  up  our  robot  Allen  to  drive  in  an  approximately 
straight  line  with  large  objects  on  either  side  of  its  path.  Figures  11  and  12  are  the  left  and 
right  full  frame  camera  views  from  approximately  the  start  of  its  path. 

The  cameras  were  connected  via  cables  to  a  Lisp  machine  and  an  image  was  digitized 
every  1/15  of  a  second  giving  a  stereo  pair  every  2/15  of  a  second.  The  robot  was  moving  at 
approximately  1.5  feet  per  second.  Thus  the  contribution  to  z0  (in  figure  10)  due  to  robot 
motion  is  approximately  1.2  inches,  which  is  small  compared  to  our  stereo  baseline  B  of 
approximately  8  inches  and  justifies  our  disregard  of  p  in  equation  (10). 

Figures  6  and  7  show  the  left  and  right  grey  level  motion  images  collected  from  100 
stereo  pairs  of  standard  images.  Each  scanline  in  figures  6  and  7  corresponds  to  averaging 
16  scanlines  from  the  centers  of  the  original  images.  Time  flows  down  these  pseudo  images, 
and  is  approximately  13  seconds  from  top  to  bottom.  Figures  8  and  9  show  the  edges 
extracted  from  the  time  images  recall  that  the  edge  detection  is  one  dimensional  along 
scanlines. 


Figure  12.  The  right  image  of  a  stero  pair  at  the  beginning  of  a  straight  line  motion  of  the  robot. 


In  these  experiments  we  made  two  passes  over  the  edge  images.  The  first  was  to  estimate 
the  centers  of  expansion  and  the  second  to  extract  motion  depth  estimates  and  correlate 
those  with  the  stereo  matches  in  order  to  get  triples  for  stereo  calibration.  On  a  robot 
running  these  algorithms  continuously  there  would  only  be  one  pass,  simultaneously  making 
minor  refinements  to  both  the  centers  of  expansion  and  stereo  calibrations. 

In  the  100  image  pairs,  the  algorithm  found  94  estimates  for  C#  in  the  left  image, 
averaging  to  243,  and  110  estimates  in  the  right  image,  averaging  to  274.  A  total  of  92  stereo 
matches  were  found,  of  which  31  were  strongly  consistent  with  motion  depth  estimates. 
These  were  used  to  calibrate  the  stereo  system  resulting  in  estimates  of 


A  =  1958.8475,  T  =  56.970116. 


Figure  13  shows  the  original  left  and  right  averaged  depth  estimates  from  the  motion  algo¬ 
rithm  for  each  of  these  31  points,  along  with  the  estimate  produced  by  the  stereo  algorithm 
with  the  derived  calibration.  The  depth  estimates  are  in  units  of  time  between  collecting 
images.  Notice  that  the  difference  between  locations  of  matched  edges  in  left  and  right  im¬ 
ages  of  particular  edges  is  sometimes  positive  and  sometimes  negative  because  the  cameras 
are  not  aligned  to  be  parallel. 

To  test  the  stereo  calibration  we  took  four  feature  points  corresponding  to  known  objects 
in  the  scene.  We  estimated  where  the  robot  had  been  when  the  first  images  in  the  sequence 
were  taken  ( to  achieve  the  constant  velocity  constraint  the  robot  had  to  be  moving  before  we 
started  collecting  images)  and  measured  the  distances  to  the  known  points  in  the  direction  of 
motion.  We  substituted  the  left  and  right  image  edge  coordinates  into  equation  (12)  to  get 
time  to  collision  estimates,  and  divided  each  of  those  into  the  known  distances  (measured  in 
feet)  and  averaged  the  result  to  get  a  calibration  of  the  stereo  system  in  feet.  The  estimate 
is  0.1874  feet  per  stereo  pair  image  step,  giving  the  robot’s  velocity  at  1.406  feet  per  second. 
F  igure  14  shows  the  unsurprising  results  of  this  procedure  in  the  top  four  rows  of  the  table. 
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Figure  13.  Display  of  the  stereo  calibration  and  results.  Left  two  columns  are  pixel  coordinates  of 
edges  in  the  left  and  right  images,  third  column  is  their  difference,  fourth  column  is  the  average 
depth  estimate  from  the  motion  algorithms  running  on  the  left  and  right  image  sequences  and  the 
fifth  column  is  the  depth  estimate  produced  by  the  stereo  system  calibrated  with  this  data. 


We  then  took  the  same  four  features  and  searched  for  them  20  images  later  in  the  image 
sequences.  The  bottom  half  of  figure  14  shows  the  stereo  estimates  for  these  four  points. 
Ideally,  all  the  time  to  collision  estimates  should  be  reduced  by  20.  They  are  reduced  bv 
18,  14,  20,  and  17.  Note  that  the  third  of  these  points  appeared  roughly  in  the  centers 
of  the  two  images  and  no  time  to  collision  estimates  were  produced  for  it  by  the  motion 
algorithms.  The  fourth  column  shows  the  distance  in  feet  that  would  have  been  estimated  if 
the  stereo  had  come  up  with  a  decrease  of  exactly  20  units  for  each  of  the  points.  The  sixth 
column  shows  the  estimate  that  was  actually  obtained.  The  relative  distances  between  the 
points  known  a  priori  from  measurement  given  in  the  fourth  column  in  the  upper  part  of 
the  table,  and  the  estimated  distances  in  the  sixth  column  of  the  lower  part  of  the  table  are 
quite  good  and  can  be  summarized  as: 

exact  —  est 
5.0  -  5.6 

8.6  -  7.5 

2.0  —  1.3 

3.6  —  1.9 

7.0  -  6.9 

10.6  —  8.8 


Note  that  these  are  relative  distances,  some  of  them  at  large  distances  from  the  robot 
itself. 


173  179 
336  364 
273  305 
485  479 


Figure  15.  Same  as  for  figure  14  but  using  a  much  more  complex  stereo  calibration  procedure. 
There  is  almost  no  difference  in  results. 


3.6  Other  calibration  schemes 


In  experiments  with  synthetic  data  the  most  signifeant  term  omitted  from  equation  (10)  in 
equation  (11)  appeared  to  be  did2  in  the  denominator.  We  therefore  repeated  the  above 
experiment  using  a  calibration  equation  of  the  form 

r  +  fldjd2  +  d\  —  d2  ^ 

The  results  were  almost  identical: 

A  =  1957.6881,  T  =  56.864132,  =  2.360905  x  10"6, 

an  estimate  of  0.1869  feet  per  image  step  (within  0.3%  of  the  previous  estimate)  or  robot 
velocity  of  1.401  feet  per  second,  and  almost  identical  results  on  the  check  with  known 
distances  as  shown  in  figure  15. 

These  experiments  convinced  us  that  (11)  is  certainly  sufficient.  But  is  it  necessary?  To 
clmck  we  tried  calibrating  with  the  same  data  to  a  model  of  stereo  that  assumes  the  cameras 
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Figure  16.  Same  as  for  figure  14  but  using  a  stereo  calibration  that  assumes  the  cameras  are  aligned 
parallel  to  the  direction  of  motion.  The  results  are  meaningless. 

are  perfectly  aligned,  i.e.,  using  a  calibration  equation  of  the  form: 

<14) 

The  results  were: 

A  =  -1213.2903, 

an  estimate  of  0.3266  feet  per  image  step  or  robot  velocity  of  2.450  (almost  a  factor  of  2 
off),  and  meaningless  results  on  known  distances  as  shown  in  16.  Clearly  this  naive  model 
is  not  sufficient. 


4.  Conclusions 

In  this  paper  we  have  demonstrated  a  number  of  aspects  of  forward  motion  analysis  and 
stereo  vision. 

The  results  of  forward  motion  analysis  are  fairly  noisy  by  the  unnatural  standards  set  by 
most  workers  in  computer  vision  and  mobile  robots.  In  this  paper  we  first  argued  that  those 
standards  are  not  necessary,  and  in  fact  are  not  met  by  humans,  most  of  whom  operate 
perfectly  well  as  autonomous  mobile  agents.  In  fact  errors  of  ±10%  do  not  seem  to  large  to 
us  for  reliable  mobile  robot  obstacle  avoidance. 

Forward  motion  analysis  has  some  wonderful  independence  properties  and  only  requires  a 
small  number  of  16  bit  fixed  precision  arithmetic  operations  to  deliver  depth  in  a  coordinate 
system  natural  to  the  task  of  obstacle  avoidance.  At  the  same  time  as  it  is  delivering 
depth  estimates  it  can  continually  recalibrate  itself  for  camera  misalignment  relative  to 
the  direction  of  motion.  All  of  these  computations  could  easily  be  transfered  to  a  parallel 
network  of  simple  processors. 

Despite  the  local  noise  in  each  forward  motion  estimate  we  demonstrated  that  with  just 
a  few  tens  of  such  measurements  we  could  calibrate  a  stereo  camera  system  with  grossly 
misaligned  cameras. 

What  does  this  buy  us? 

It  means  we  can  have  a  vision  system  on  board  a  robot  that  doesn’t  require  a  time 
consuming  calibration  stage  at  the  beginning  of  an  experimental  run.  Rather  we  can  build 
power-up  and  go  systems.  If  the  sensor  platform  drifts  mechanically  from  the  drive  align¬ 
ment  (as  it  does  on  many  real  robots)  over  time  there  is  no  calibration  problem — the  al- 


goritlims  given  in  this  paper  continually  adjust.  If  there  is  more  severe  and  sudden  mis¬ 
alignment,  e.g.,  due  to  a  hard  collision,  the  robot  will  be  disoriented  for  only  a  few  seconds 
before  it  accommodates. 

More  than  this  however,  the  possibility  of  simple  fast  dynamic  calibration  to  the  world 
opens  up  the  possibility  of  having  cheap  (and  hence  mechanically  sloppy)  steerable  sensor 
platforms.  If  we  can  quickly  and  cheaply  compute  all  we  need  to  know  about  how  such  a 
platform  is  aligned  we  will  not  feel  pressure  to  spend  inordinate  amounts  of  money  building 
a  precise  platform. 

Silicon  is  getting  cheaper  a  lot  quicker  than  precision  machined  parts  are.  We  should 
search  for  ways,  as  this  paper  does,  of  trading  off  silicon  for  such  precision  machined  parts. 
If  the  analysis  and  algorithms  are  simple  we  have  a  better  chance  of  them  being  robust. 
Of  course  it  is  critical  to  back  up  analysis  with  experiment — analysis  in  computer  vision 
without  experiment  is  often  a  worthless  intellectual  pursuit. 
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Appendix — Kludge  Factors 

Almost  all  computer  vision  programs  have  a  large  number  of  “tweakable”  parameters  hidaen 
in  their  code.  These  are  usually  used  to  implement  certain  English  phrases  used  in  the 
scientific  paper  which  describes  the  work.  For  instance  there  might  be  a  phrase  like  “for 
sufficiently  close  edge  terminations  we  In  this  paper  we  have  tried  to  explicity  state 

all  such  kludge  factors  that  we  have  used.  For  the  reader’s  convenience  we  also  collect  them 
here  and  give  the  values  we  used  for  them  throughout  all  experiments  we  have  done. 

The  edge  mask  (derivative  of  a  Gaussian)  is: 

1  3  5  9  14  18  20  18  11  0  -  11  -  18  -  20  -  18  -  14  -  9  -  5  -  3  -  1. 

The  threshold  strength  we  demand  of  edges  is  500.  The  same  edges  are  used  for  stereo 
and  motion  algorithms. 

In  the  dynamic  programming  portion  of  the  stereo  algorithm  we  assign  a  cost  of  2000 
to  skipped  matches,  and  the  sum  of  squares  of  pixel  grey  level  differences  in  a  window  of  7 
pixels  centered  on  the  left  and  right  edges  for  accepted  matches. 

In  linking  edges  in  the  motion  algorithm  we  start  off  with  a  search  window  of  ±3  pixels 
and  later  narrow  that  to  -1  to  +3  pixels  from  the  most  recent  step. 

To  compute  the  edge  velocity  in  the  motion  algorithms  we  use  the  slope  of  a  chord  joining 
edge  locations  V  =  4  images  apart.  In  estimating  Ce  we  do  this  twice  at  3V  =  12  images 
apart. 

We  smooth  the  time  to  collision  estimates  over  S  -  4  time  intervals  (i.e.,  5  images). 

In  deciding  on  consistent  depth  estimates  and  stereo  matches  we  demand  that  the  left 
and  right  motion  depth  estimates  lie  within  ±10%  of  their  average. 


